Note: You should run this notebook on the Pierce Nupack cluster

Homework 4.5: Looking beyond sequence similarity (25 pts)¶


In order to run this homework problem, you'll first need to run a few commands in the shell to set up the environment. You will be using modern bioinformatics tools to complete this homework, including MMseqs2 (https://www.nature.com/articles/nbt.3988) and TRILL (https://www.biorxiv.org/content/10.1101/2023.10.24.563881v2). While the former is specialized tool for comparing and clustering large amounts of sequencing data, the latter is more of a toolbox that brings together useful protein analysis and design methods in a unified framework.

  1. $ "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
    • After entering the previous line and running it on the command line, press enter 4 times to perform the default installation. You'll then need to restart your shell, which can be done by simply exiting out of the cluster and restarting it.
  2. $ micromamba create -n TRILL python=3.10 ; micromamba activate TRILL
  3. $ micromamba install pytorch==1.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia -c conda-forge gcc gxx openmm swig pdbfixer ipykernel openbabel smina fpocket vina -c bioconda pyrsistent foldseek -c pyg pyg=2.3.1=py310_torch_1.13.0_cu117 pytorch-cluster=1.6.1=py310_torch_1.13.0_cu117 pytorch-sparse=0.6.17=py310_torch_1.13.0_cu117 pytorch-scatter=2.1.1=py310_torch_1.13.0_cu117
  4. $ micromamba install -c bioconda -c "dglteam/label/cu117" dgl
  5. $ pip install git+https://github.com/martinez-zacharya/lightdock.git@03a8bc4888c0ff8c98b7f0df4b3c671e3dbf3b1f
  6. $ pip install trill-proteins

Check to see if TRILL was successfully installed by running -

$ trill -h

Part 1¶

Below are examples of potential failure modes for homology detection. We have two example datasets, one focusing on viral homology and identifying microviral hallmark proteins (VP1s https://journals.asm.org/doi/10.1128/mbio.00588-22), while the other is for classifying proteins based on their ability to act as cell-penetrating peptides (CPPs https://www.biorxiv.org/content/10.1101/2020.10.15.341149v1.full). Both datasets have roughly the same number of sequences and are designed to train binary classifiers to predict whether a protein is a valid CPP/VP1 or not.

The following two lines perform hierarchical clustering using sequence alignments with MMseqs2.

mmseqs easy-cluster CPP_DatasetE.fasta CPP_DatasetE_out tmp

mmseqs easy-cluster mixed_vir.fasta mixed_vir.out tmp

What does it mean for proteins to be homologous and why is homology detection important?


Homologous proteins have a shared evolutionay history. Homology detection is important to understand how protein function and stability has shifted over time. Protein engineers can take advantage of this history to aid in predicting how proteins can be modified for new functions while maintaining their stability. It is also very useful for structure predictions.


Part 2¶

After performing the clustering, you should now use TRILL to extract high-dimensional representations (embeddings) from the protein sequences using a pretrained protein language model (PLM). For this homework, you should use esm2_t6_8M, a PLM from Meta AI (https://www.science.org/doi/10.1126/science.ade2574). Below is an example of how the CPP dataset was embedded, the output will be a CSV where each row is a protein and the last column is the protein name derived from the fasta headers. For more info on TRILL, you can check out the documentation https://trill.readthedocs.io/en/latest/index.html

$ trill CPPs 0 embed esm2_t6_8M CPP_DatasetE.fasta --avg

What do the clusters mean and why is there a difference in the number of clusters between the datasets, even though the datasets are designed for binary classification?


VP1s are evolutionaryily related and CPPs are broadly functionally related and not homologous. Therefore it is more likely for the VP1s to be clustered based on their sequences. Since the CPPS do not share an evolutionary history, their sequences are likely to be more varied, making clustering based on sequence difficult.

Part 3¶

Embed the proteins using ESM2, perform clustering (hierarchical, k-means etc.) and then PCA to visualize the points in 2D, colored by cluster. What are potential reasons that can explain differences between MMseqs2 and ESM2 clustering and why did you choose your clustering algorithm? A helpful pandas method is df.sample(), where you can randomly subsample your data to speed up debugging and trying different algorithms


MMseqs2 has many more clusters than from the embeded datasets. On the embeded datasets, we enforced a maximum number of clusters as 2. MMsewqs does not have this constraint. We set the max number of clusters to 2 because we are clustering a binary dataset (ex. the proteins are either CPPs or not). I would choose the K-means clustering alogrithm since it linearly separates the data on the PCA plots, which seems to define the most obvious groups for clustering.

In [ ]:
## RUN BOTTOM CELL (w code) FIRST
# CPPs
cpps = pd.read_csv('CPPs_esm2_t6_8M_AVG.csv')
cpp_tsv = 'CPP_DatasetE_out_cluster.tsv'
hierarchical_clustering_and_pca(cpps, cpp_tsv)
In [ ]:
# VP1s
vp1s = pd.read_csv('VP1s_esm2_t6_8M_AVG.csv')
vp1_tsv = 'mixed_vir.out_cluster.tsv'
hierarchical_clustering_and_pca(vp1s, vp1_tsv)

Part 4¶

What are the pros and cons of different homology detection techniques?


The pros of embeding and clustering is that one is able to get clearly defined labels (binary for our dataset). However, the embeding can impart its own biases on the dataset. In comparison, the mmseqs2 clustering is based directly on sequence simlaity. This makes clustering more challenging and results in many more clusters. However, it is the "truest" way to cluster proteins since there is no transformation.

Part 5¶

How can different representations (sequence, structure, embedding, etc.) help in different applications? Are there situations where one type of representation is more or less useful?


On homoloogy detection, sequence based representations are useful because they preserve the order and likihood of the residues in a protein family. For function detection, structural represenations are most useful since structure often provides clues the function of the protein (ie. orienation of catalytic residues in an active site). Embedings are useful when the analysis of interest needs to be computed on numeric data.

Part 6¶

Perform a similar analysis for your desired group of "related" proteins, including mmseqs clustering, extracting embeddings and then clustering them. Your proteins of interest only need to share some sort of similarity, such as structure, sequence, or function. Compare the clustering results.

A good place to look for proteins is NCBI (National Center for Biotechnology Information) https://www.ncbi.nlm.nih.gov/guide/proteins/. From the website, you can download the FASTA files for your desired sequences. Use at least 50 sequences in your analysis.

Below are code examples that you can use for inspiration.

The clustering is much more clear the protein family CY153 dataset. This is because these proteins (which are known to hydrolyze alkanes) are closely evolutionarily related. The mmseqs2 clustering seems to appear the best, whereas the k-means and hierarchical clustering group distantly related proteins (in green). On this dataset, the mmseqs clustering outperforms the embeded and clustered datasets. This is the opposite of the previous examples.

In [ ]:
# CY153 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3147008/)
CY153s = pd.read_csv('CY153_esm2_t6_8M_AVG.csv')
CY153s.Label = CY153s.Label.apply(lambda x: x.split('|')[1])
CY153_tsv = 'CY153.out_cluster.tsv'
hierarchical_clustering_and_pca(CY153s, CY153_tsv)
In [ ]:
def hierarchical_clustering(data, method='ward'):
    """
    Perform hierarchical clustering on a given pandas dataframe and return both the linkage
    matrix and the dendrogram data.

    Parameters:
    data (pd.DataFrame): The input dataset to be clustered. Each row corresponds 
                         to an observation and each column corresponds to a feature.
    method (str): The method to be used for clustering. This parameter is optional 
                  and defaults to 'ward'.

    Returns:
    tuple: A tuple containing:
        - clustering_output (ndarray): The linkage matrix representing the hierarchical 
                                       clustering. The matrix contains the hierarchical 
                                       clustering encoded as a linkage matrix.
        - dendro_data (dict): A dictionary containing the data required to reconstruct 
                              the dendrogram. This can be used for further analysis or 
                              visualizations.
    """
    output = scipy.cluster.hierarchy.linkage(data, method)
In [ ]:
def extract_mmseqs_clusters(tsv_path):
    """
    Parse a TSV file generated by MMseqs2 to create a mapping of clusters to their member sequences.

    The TSV file should contain two columns: 'Cluster' and 'Sample', where 'Cluster' represents the cluster
    identifier and 'Sample' represents the member sequence identifier. The function will read the TSV file and
    create a dictionary with each cluster as a key and a list of member sequence identifiers as values.

    Parameters:
    tsv_path (str): The file path to the TSV file to be parsed.
    
    Returns:
    dict: A dictionary with cluster identifiers as keys and lists of member sequence identifiers as values.
    """    
In [ ]:
def perform_pca(df):
    """
    Perform Principal Component Analysis (PCA) on a given DataFrame.

    This function takes a DataFrame with numerical features and applies PCA to reduce the dimensionality to two principal components. 
    The result is a new DataFrame containing the first two principal components labeled as 'PC1' and 'PC2'.

    Parameters:
    df (pandas.DataFrame): A pandas DataFrame with numerical features to perform PCA on.

    Returns:
    pandas.DataFrame: A DataFrame with two columns corresponding to the first two principal components of the input data.
    """
    output = sklearn.decomposition.PCA(n_components=2).fit_transform(df)

You may refactor the code below for the purposes of completing this homework problem.

Cell-Penetrating Peptide example plots¶

In [ ]:
# clustering & plotting

import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.decomposition import PCA
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, Range1d, LabelSet, Legend, LegendItem, Div
from bokeh.layouts import gridplot, column
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from bokeh.core.properties import value
import matplotlib
import matplotlib.colors as mcolors
from collections import defaultdict
from matplotlib.colors import rgb2hex, colorConverter

output_notebook()


def get_cluster_classes(dendro_data, Z, color_threshold):
    # Find the color threshold
    max_distance = max([d[2] for d in dendro_data['dcoord']])
    threshold = color_threshold * max_distance

    # Determine clusters using fcluster by cutting the dendrogram at the specified threshold
    cluster_ids = fcluster(Z, threshold, criterion='distance')

    # Create a dictionary mapping each original label to its cluster id
    cluster_classes = {dendro_data['ivl'][i]: cluster_id for i, cluster_id in enumerate(cluster_ids)}

    return cluster_classes

def add_labels_to_dendrogram(dendro_data, plot_figure, min_y):
    # Get the positions of the leaves
    leaf_positions = get_leaf_positions(dendro_data)

    # Generate the label texts, changing to "VP1" if "@" is present
    # this is to make the dendrogram a bit easier to visually parse
    label_texts = ["VP1" if "@" in label else label for label in leaf_positions.keys()]

    # Create ColumnDataSource with the correct coordinates and label texts
    source = ColumnDataSource(data=dict(
        x=list(leaf_positions.values()),
        y=[min_y] * len(leaf_positions),
        labels=label_texts,
    ))

    # Create and add the LabelSet to the plot
    labels = LabelSet(
        x='x', y='y', text='labels', source=source,
        text_font_size='8pt', angle=90, angle_units='deg',
        text_align='center', text_baseline='middle',
    )
    plot_figure.add_layout(labels)


def get_leaf_positions(dendro_data):
    """
    Given dendrogram data, calculate the positions of the leaves
    """
    ivl = dendro_data['ivl']
    n_leaves = len(ivl)
    # Leaves are evenly spaced along the x-axis
    leaf_positions = {label: (10 * (i + 1) - 5) for i, label in enumerate(ivl)}
    return leaf_positions

def add_cluster_classes(plot_figure, dendrogram_data, cluster_classes):
    """
    Add cluster class information to the dendrogram
    """
    leaf_positions = get_leaf_positions(dendrogram_data)
    for label, x_position in leaf_positions.items():
        cluster_class = cluster_classes.get(label, '')
        # Adding labels to dendrogram
        plot_figure.text(x=[x_position], y=[0], text=[cluster_class],
                         text_font_size='8pt', text_align='center')


# This function is much more complicated than it needs to be for demonstration purposes
# Your (the student) function can be much simpler, as long as your desired conclusion is
# still able to be drawn from the plot or potentially describtive statistics, counts etc.
# All of the html hardcoded into the function is unnecessary for actually doing the homework,
# it's just for demonstration.
def create_interactive_plot(pc_df, cluster_column, title, color_column):
    # Fixed height for the plot area
    plot_height = 400
    p = figure(width=800, height=plot_height, title=title)

    # Initialize an empty string for the legend HTML
    legend_html = "<div style='display: table; width: 100%; table-layout: fixed;'>"

    for i, category in enumerate(pc_df[cluster_column].unique()):
        category_df = pc_df[pc_df[cluster_column] == category]
        category_source = ColumnDataSource(category_df)
        color = category_df[color_column].iloc[0]
        # Truncate the category name to the first 15 characters for the legend label
        legend_label = (str(category)[:15] + '...') if len(str(category)) > 20 else str(category)
        p.scatter('PC1', 'PC2', source=category_source, size=10, fill_alpha=0.6,
                  line_color="white", color=color, legend_label=legend_label)
        
        # Append to the legend HTML
        cell_style = "style='display: table-cell; text-align: center; vertical-align: middle;'"
        color_span = f"<span style='background:{color};'>&nbsp;&nbsp;&nbsp;&nbsp;</span>"
        legend_html += f"<div {cell_style}>{color_span} {legend_label}</div>"
        
        # Close the row in the legend table and start a new one every 5 items
        if (i + 1) % 5 == 0:
            legend_html += "</div><div style='display: table; width: 100%; table-layout: fixed;'>"

    # Close the legend HTML
    legend_html += "</div>"

    # Create a Div using the legend HTML
    legend_div = Div(text=legend_html, width=800, height=100, style={'overflow':'auto'})
    p.legend.visible = False
    hover = HoverTool()
    hover.tooltips = [("Label", "@label")]
    p.add_tools(hover)
    # Combine plot and legend Div in a column layout
    combined_layout = column(p, legend_div, sizing_mode='scale_width')

    return combined_layout

# Again, this function could be refactored to be much more elegant
def hierarchical_clustering_and_pca(data, tsv_path, color_threshold=None):
    # Exclude the last column (labels) for clustering and PCA
    X = data.iloc[:, :-1]
    original_labels = data.iloc[:, -1].astype(str).values 
    # Perform hierarchical clustering
    Z = linkage(X, method='ward')

    if color_threshold is None:
        color_threshold = 0.7 * np.max(Z[:, 2])
    
    # This is the color map from the dendrogram so that the clusters are properly colored
    # You might be able to find a better way to do the coloring
    color_map = {'C1': '#ff7f0e', 'C2': '#2ca02c', 'C3': '#d62728', 'C4': '#9467bd', 'C0': '#1f77b4'}
    
    # fcluster() is used to partition the hierarchical clustering data into clusters
    # based on a user-defined criterion and t-value. In this case, we chose maxclust
    # and t=2 so that at most, there will be 2 clusters, because we know ahead of time
    # my dataset has twoish groups
    hc_labels = fcluster(Z, t=2, criterion='maxclust')
    clustered_labels = pd.DataFrame({
    'Original_Label': original_labels,
    'Cluster_Label': hc_labels
    })
    
    # If you removed no_plot=True, you could simply use matplotlib to plot the dendrogram.
    # The reason we make the dendrogram in bokeh is for easy viewing for demonstrative purposes.
    # Your dendrogram won't be the best way to visualize the clustering results, but it is still
    # useful to see, especially when comparing distances across datasets
    dendro_data = dendrogram(Z, labels=original_labels, color_threshold=color_threshold, leaf_rotation=90., no_plot=True)

    p = figure(title="Dendrogram", width=1200, height=800, x_axis_label="Sample index", y_axis_label="Distance",
               tools="pan,wheel_zoom,reset,save", toolbar_location="above", active_scroll="wheel_zoom")

    # Plot dendrogram lines
    for i, d in enumerate(dendro_data['icoord']):
        xs = d
        ys = dendro_data['dcoord'][i]
        color = color_map[dendro_data['color_list'][i]]
        p.line(xs, ys, color=color)

    # Determine the ranges for the entire dendrogram
    max_x = max([max(d) for d in dendro_data['icoord']])
    min_x = min([min(d) for d in dendro_data['icoord']])
    max_y = max([max(d) for d in dendro_data['dcoord']])
    min_y = min([min(d) for d in dendro_data['dcoord']])

    p.x_range = Range1d(start=min_x, end=max_x)
    p.y_range = Range1d(start=min_y, end=max_y)

    # Prepare a data source for the labels
    labels = dendro_data['ivl']
    x_coords = np.array([d[2] for d in dendro_data['icoord']][:len(labels)])
    y_coords = np.array([d[1] for d in dendro_data['dcoord'] if d[1] == d[2]])
    
    add_labels_to_dendrogram(dendro_data, p, min_y)
    x_midpoints = [d[1] for d in dendro_data['icoord']]

    # Now, find the corresponding y-coordinates
    leaf_ys = []
    for x in x_midpoints:
        # Find the index in icoord where x_mid is found
        index = next(i for i, d in enumerate(dendro_data['icoord']) if d[1] == x)
        # Use this index to get the corresponding y-coordinate from dcoord
        y = dendro_data['dcoord'][index][1]  # y_mid
        leaf_ys.append(y)


    # Find the x-coordinates for the labels at the leaf node tips
    leaf_x_coords = [d[2] for d in dendro_data['icoord']]

    # Ensure the lengths of coordinates and labels match
    min_length = min(len(leaf_x_coords), len(labels))
    leaf_x_positions = [min(ic[1:3]) for ic in dendro_data['icoord']]
    labels = labels[:min_length]
    adjusted_x_coords = [(icoord[0] + icoord[1]) / 2 for icoord in dendro_data['icoord']]


    show(p)

    # Perform PCA
    pca = PCA(n_components=2)
    principal_components = pca.fit_transform(X)
    pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
    
    # Add original labels to pc_df
    pc_df['label'] = original_labels

    # Add 'hc_color' column to pc_df based on hierarchical clustering labels
    pc_df['hc_cluster'] = ['C' + str(label) for label in hc_labels]  # Map labels to 'C1', 'C2', etc.
    pc_df['hc_color'] = pc_df['hc_cluster'].map(color_map).fillna('gray')  # Apply color map

    # Load the TSV file
    tsv_clusters = pd.read_csv(tsv_path, sep='\t', header=None)
    tsv_clusters.columns = ['Cluster', 'Sample']

    # Map samples to clusters from TSV file
    tsv_cluster_map = dict(zip(tsv_clusters['Sample'], tsv_clusters['Cluster']))
    
    tsv_cluster_reps = tsv_cluster_map.keys()
    
    # Generating a colormap from tab20, which has 20 colors, so that the coloring
    # simply loops around, since the example datasets have many clusters when
    # mmseqs2 clustering is performed.
    unique_clusters = list(set(tsv_cluster_map.values()))  # Get unique clusters from TSV mapping
    num_unique_clusters = len(unique_clusters)
    tab20_cmap = matplotlib.cm.get_cmap('tab20')
    tab20_colors_hex = [matplotlib.colors.to_hex(tab20_cmap(i/num_unique_clusters)) for i in range(num_unique_clusters)]
    color_map = {cluster: tab20_colors_hex[i % num_unique_clusters] for i, cluster in enumerate(unique_clusters)}


    pc_df['tsv_color'] = pc_df['label'].map(tsv_cluster_map).map(color_map)
    pc_df['tsv_cluster'] = pc_df['label'].map(tsv_cluster_map)
    pc_df['label'] = pc_df['label'].astype(str)
    clustered_labels['Original_Label'] = clustered_labels['Original_Label'].astype(str)

    # Merge the DataFrames on the label columns
#     pc_df = pc_df.merge(clustered_labels, left_on='label', right_on='Original_Label')
    # K-means is another unsupervised form of clustering
    kmeans = KMeans(n_clusters=2, n_init=10)
    kmeans_labels = kmeans.fit_predict(principal_components)
    pc_df['kmeans_cluster'] = kmeans_labels
    pc_df['kmeans_color'] = ['#2ca02c' if label == 1 else '#ff7f0e' for label in kmeans_labels]


#     pc_df.drop(columns=['Original_Label'], inplace=True)
    plot1 = create_interactive_plot(pc_df, 'tsv_cluster', "PCA of ESM2 embeddings colored by MMseqs2 sequence clustering", 'tsv_color')
    plot2 = create_interactive_plot(pc_df, 'hc_cluster', "PCA of ESM2 embeddings colored by hierarchical clustering (maxclust=2)", 'hc_color')
    plot3 = create_interactive_plot(pc_df, 'kmeans_cluster', "PCA of ESM2 embeddings colored by K-Means Clustering (K=2)", 'kmeans_color')

    # Arrange all plots in a grid
    plots_grid = gridplot([[plot1, plot2], [plot3]])

    show(plots_grid)
Loading BokehJS ...